Alajuela
MessIRve: A Large-Scale Spanish Information Retrieval Dataset
Valentini, Francisco, Cotik, Viviana, Furman, Damián, Bercovich, Ivan, Altszyler, Edgar, Pérez, Juan Manuel
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
- North America > United States > California > Santa Barbara County > Santa Barbara (0.14)
- North America > Mexico (0.04)
- South America > Colombia > Bogotá D.C. > Bogotá (0.04)
- (34 more...)
A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era
Ren, Zhao, Chang, Yi, Nguyen, Thanh Tam, Tan, Yang, Qian, Kun, Schuller, Björn W.
Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > Portugal > Coimbra > Coimbra (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- (44 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
- Education (1.00)
- Health & Medicine > Health Care Providers & Services (0.92)
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets
Yuan, Ann, Ippolito, Daphne, Nikolaev, Vitaly, Callison-Burch, Chris, Coenen, Andy, Gehrmann, Sebastian
NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (35 more...)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A Generalized Additive Model and Random Forest approach
Vásquez, Paola, Loría, Antonio, Sanchez, Fabio, Barboza, Luis A.
Climate has been an important factor in shaping the distribution and incidence of dengue cases in tropical and subtropical countries. In Costa Rica, a tropical country with distinctive micro-climates, dengue has been endemic since its introduction in 1993, inflicting substantial economic, social, and public health repercussions. Using the number of dengue reported cases and climate data from 2007-2017, we fitted a prediction model applying a Generalized Additive Model (GAM) and Random Forest (RF) approach, which allowed us to retrospectively predict dengue occurrence in five climatological diverse municipalities around the country.
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- Asia > China > Guangdong Province (0.14)
- Africa > Liberia (0.06)
- (16 more...)